Introduction

You’ve probably all heard or are part of the current drive to create reproducible pipelines within the department. Some of you might be very familiar with them, others of you just using methods and ideas from them as part of your regular work.

A little bit prior to Christmas, I started working on putting together a proof of concept within my team based on a presentation of our shadow measures reforms aaaages ago, with the goal being partly to upskill myself and apply more efficient ways of working in R. I also wanted to act as a bit of a pathfinder to experiment how I could join all these ideas I’d heard in various coffee and coding sessions about projects, functions, connecting to SQL etc would work, in such a way that someone coming at the work fresh would be able to generate everything I had done (or with a new dataset) within minutes.

This proof of concept joins together many of the concepts you’ve heard about previously in a logical approach, going from setting out a project workflow structure all the way through to outputting a final presentation file.

Everything that transpires does so according to my design…

How many times have we come across something like this when using Excel/SQL together in our teams?

You’ve got 3 different file types, with analysis and source code spread across more than a year of time, and written by 3 different authors. Sure, we can spend a while figuring out which code came first, which version is the original, maybe we can use date modified - or what about version or year written into the filename? Not ideal…

When I first started working with R, an R script meant I could finally join disparate bits of analysis together, and joining up with SQL meant I could now relate figures to chunks of code and tables. Great! But wait, it becomes one ugly, very long, convoluted script with a lot of scrolling and ever-increasing risk of repetitive strain injury.

Enter stage left -> R projects. R projects mean you can create a self-contained, reproducible project set out in the way you want it. We’ve had a previous run-through in Coffee and Coding on R project workflow (found here), so I won’t go into excessive detail. Suffice to say, it’s great for keeping track because:

  • Integrity of your data is ensured
  • You can relate correct figures and tables to the exact code used to generate it.
  • It makes it easy to share your code with someone else
  • You can add in gooey git for version control and easier QA

You can set it up so easily -

  1. Click the “File” menu button, then “New Project”.
  2. Click “New Directory”.
  3. Click “New Project”.
  4. Type in the name of the directory to store your project, e.g. “my_project”.
  5. If available, select the checkbox for “Create a git repository.”
  6. Click the “Create Project” button.

A few best practices for project organisation I’ve picked up along the way.

  • Treat data as read only - if you’re working with data interactively (as in Excel) where they could be modified, it means you’re never sure of where the data comes from.
  • Data cleaning - Since I was regurgitating already previously cleaned data stored in SQL, I didn’t have to do this. However, it’s a good idea to separate out data cleaning scripts and create a second “read-only” data folder to hold the “clean” data sets so we’re not getting confused between that and “dirty” source data if we’re using Excel data.
  • Treat generated output as disposable - Anything generated should be able to be regenerated by the scripts. Having an output folder with different sub-directories for each separate bit of analysis makes things easier later.
  • Effective naming - Goes without saying, but name all files to reflect their content or function.

Taken together, it creates something that looks a bit like this. I have separate scripts for my general packages like ggplot and reading stuff in. I have a custom functions script which covers functions I’ve created for this piece of work, and then sequentially numbered scripts that run from data load through our various measures and some additional analysis, as well as a Master script that brings everything together later on ready for input into a presentation file. An Outputs folder that is broken out by source script/measure contains any generated charts and images.

Yo dawg, I heard you liked functions, so I put functions in your functions

On opening the R project file, all the requisite scripts are already open sequentially, from functions on the left to the presentation file on the end, with a preview visible.

A consequence of having six different cohorts to report on was that there was a large degree of repetition in what charts and analysis could be produced. When my project was in the early stages, the initial scripts for attainment, Retention, MLManalysis contained many lines of directly executed code.

As you might expect, when I went to go back through the code to make minor changes or where something hadn’t quite worked, it meant trawling through long bits of code to correct what might just be one line.

Enter stage right <- functions.

Like Inception, we can go deeper. You can nest your functions and calls to public packages such that entering in a couple of keywords generates a beautiful final piece of analysis which you can link to alternative datasets again and again and again. I’ve taken a couple of examples from my Retention code - a basic plotter which just generates a single histogram but is repetitive, and a more complex function that combines analysis from three separate measures for 4 different cohorts and overlays values into one final chart.

Simple charts

The code here just produces a bog-standard histogram chart for each cohort. Since we’re pretty much producing the same chart over and over again, I decided it would be easier to break it out with some simple if statements. I set up labelling and colours at the beginning of the function, so I don’t need to worry about that going forward, and depending on what cohort I enter in, it will set the title accordingly, all consistently formatted.

#### Retention - cohort Distribution ####
# Plots the distribution of retention rate by cohort for institutions.

# plot_name = Character string, name of output plot in global environment.

# df = Name of data frame variable containing Retention data. Must have 
  # Inscope and PT_Retained variable fields for cohorts.

# cohort = Character string, selected cohort for plotting. Can take the values:
  # "acad", "alev", "agen", "L2VOC", "techcert" as strings.
  
# Cohort variable takes the values: acad, alev, agen, tlev, techcert, & l2voc when
  # setting up output chart characteristics. 
# Data variable must have Retention data (e.g. Inscope_ACAD, PT_retained) to work.
ret_cohortdist <- function(plot_name, df, cohort){
  
# Formatting and setup for binning
  breaks_retention = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 
                       45, 50, 55, 60, 65, 70, 75, 80, 
                       85, 90, 95, 100, 100.5) 
  labels_retention = c("0%", "", "10%", "", "20%", "", "30%", "", "40%", 
                       "", "50%", "", "60%", "", "70%", "", "80%", "", 
                       "90%", "", "100%")
  #levels_retention = c("0%", "5", "10%", "15", "20%", "25", "30%", "35", "40%", 
  #                     "45", "50%", "55", "60%", "65", "70%", "75", "80%", "85", 
  #                     "90%", "95", "100%") # Needed for where we combine two groups for shifting bars correctly.
  
  retention_colours = c("purple3", "sienna2", "palegreen3", "steelblue3", "blue4", "firebrick3") 
  
  
# Produce cohort plot
  if(cohort == "acad"){
      output <-  transform(as.data.frame(table(cut(df %>%
                                                     filter(T_INSCOPE_ACAD_RET != 0) %>% 
                                                     .$PT_RETAINED_ACAD_RET, 
                                                   breaks= breaks_retention, 
                                                   include.lowest = TRUE, 
                                                   right = FALSE))), # forces cut to ignore 
                           # values at top e.g. 95-100, counts to 99.9
                           Percentage = round(prop.table(Freq), 2)) %>% 
       ggplot(., aes(x = Var1, y = Percentage)) +
       geom_bar(stat = "identity", 
                position = position_nudge(x = 0.5), # position_nudge moves bars right
                
        # Formatting         
                fill =  retention_colours[1]
                ) + 
        scale_x_discrete("Retention Rate", 
                         labels = labels_retention) + 
        scale_y_continuous(name = "Percentage of schools & colleges", 
                           limits = c(0, 0.6), 
                           labels = scales::percent) +  
        theme_hc() + 
        theme(axis.text.x = element_text(angle = 65, hjust = 1), 
              text = element_text(size = 15)) + 
        
        # Annotation
        annotate("text", x = 4, y = 0.55, 
                 label = "Academic", 
                 size = 5)
    } 
  if(cohort == "alev"){
    output <- transform(as.data.frame(table(cut(df %>%
                                                  filter(T_INSCOPE_ALEV_RET != 0) %>% 
                                                  .$PT_RETAINED_ALEV_RET, 
                                                breaks= breaks_retention, 
                                                include.lowest = TRUE, right = FALSE))), 
                        Percentage = round(prop.table(Freq), 2))%>% 
      ggplot(., aes(x = Var1, y = Percentage)) +
      geom_bar(stat = "identity", 
               position = position_nudge(x = 0.5), # position_nudge moves bars right
               
               # Formatting         
               fill =  retention_colours[2]
      ) + 
      scale_x_discrete("Retention Rate", 
                       labels = labels_retention) + 
      scale_y_continuous(name = "Percentage of schools & colleges", 
                         limits = c(0, 0.6), 
                         labels = scales::percent) +  
      theme_hc() + 
      theme(axis.text.x = element_text(angle = 65, hjust = 1), 
            text = element_text(size = 15)) + 
      
      # Annotation
      annotate("text", x = 4, y = 0.55, 
               label = "A level", 
               size = 5)
  } 
  if(cohort == "tlev"){
    output <-  transform(as.data.frame(table(cut(df %>%
                                                   filter(T_INSCOPE_TLEV_RET != 0) %>% 
                                                   .$PT_RETAINED_TLEV_RET, 
                                                 breaks= breaks_retention, 
                                                 include.lowest = TRUE, 
                                                 right = FALSE))), # forces cut to ignore 
                         # values at top e.g. 95-100, counts to 99.9
                         Percentage = round(prop.table(Freq), 2)) %>% 
      ggplot(., aes(x = Var1, y = Percentage)) +
      geom_bar(stat = "identity", 
               position = position_nudge(x = 0.5), # position_nudge moves bars right
               
               # Formatting         
               fill =  retention_colours[3]
      ) + 
      scale_x_discrete("Retention Rate", 
                       labels = labels_retention) + 
      scale_y_continuous(name = "Percentage of schools & colleges", 
                         limits = c(0, 0.6), 
                         labels = scales::percent) +  
      theme_hc() + 
      theme(axis.text.x = element_text(angle = 65, hjust = 1), 
            text = element_text(size = 15)) + 
      
      # Annotation
      annotate("text", x = 4, y = 0.55, 
               label = "Tech level", 
               size = 5)
  } 
  if(cohort == "agen"){
    output <-  transform(as.data.frame(table(cut(df %>%
                                                   filter(T_INSCOPE_AGEN_RET != 0) %>% 
                                                   .$PT_RETAINED_AGEN_RET, 
                                                 breaks= breaks_retention, 
                                                 include.lowest = TRUE, 
                                                 right = FALSE))), # forces cut to ignore 
                         # values at top e.g. 95-100, counts to 99.9
                         Percentage = round(prop.table(Freq), 2)) %>% 
      ggplot(., aes(x = Var1, y = Percentage)) +
      geom_bar(stat = "identity", 
               position = position_nudge(x = 0.5), # position_nudge moves bars right
               
               # Formatting         
               fill =  retention_colours[4]
      ) + 
      scale_x_discrete("Retention Rate", 
                       labels = labels_retention) + 
      scale_y_continuous(name = "Percentage of schools & colleges", 
                         limits = c(0, 0.6), 
                         labels = scales::percent) +  
      theme_hc() + 
      theme(axis.text.x = element_text(angle = 65, hjust = 1), 
            text = element_text(size = 15)) + 
      
      # Annotation
      annotate("text", x = 4, y = 0.55, 
               label = "Applied General", 
               size = 5)
  }
  if(cohort == "l2voc"){
    output <-  transform(as.data.frame(table(cut(df %>%
                                                   filter(T_INSCOPE_L2VOC_RET != 0) %>% 
                                                   .$PT_RETAINED_L2VOC_RET, 
                                                 breaks= breaks_retention, 
                                                 include.lowest = TRUE, 
                                                 right = FALSE))), # forces cut to ignore 
                         # values at top e.g. 95-100, counts to 99.9
                         Percentage = round(prop.table(Freq), 2)) %>% 
      ggplot(., aes(x = Var1, y = Percentage)) +
      geom_bar(stat = "identity", 
               position = position_nudge(x = 0.5), # position_nudge moves bars right
               
               # Formatting         
               fill =  retention_colours[5]
      ) + 
      scale_x_discrete("Retention Rate", 
                       labels = labels_retention) + 
      scale_y_continuous(name = "Percentage of schools & colleges", 
                         limits = c(0, 0.6), 
                         labels = scales::percent) +  
      theme_hc() + 
      theme(axis.text.x = element_text(angle = 65, hjust = 1), 
            text = element_text(size = 15)) + 
      
      # Annotation
      annotate("text", x = 4, y = 0.55, 
               label = "L2 Vocational", 
               size = 5)
  }  
  if(cohort == "techcert"){
    output <-  transform(as.data.frame(table(cut(df %>%
                                                   filter(T_INSCOPE_TECHCERT_RET != 0) %>% 
                                                   .$PT_RETAINED_TECHCERT_RET, 
                                                 breaks= breaks_retention, 
                                                 include.lowest = TRUE, 
                                                 right = FALSE))), # forces cut to ignore 
                         # values at top e.g. 95-100, counts to 99.9
                         Percentage = round(prop.table(Freq), 2)) %>% 
      ggplot(., aes(x = Var1, y = Percentage)) +
      geom_bar(stat = "identity", 
               position = position_nudge(x = 0.5), # position_nudge moves bars right
               
               # Formatting         
               fill =  retention_colours[4]
      ) + 
      scale_x_discrete("Retention Rate", 
                       labels = labels_retention) + 
      scale_y_continuous(name = "Percentage of schools & colleges", 
                         limits = c(0, 0.6), 
                         labels = scales::percent) +  
      theme_hc() + 
      theme(axis.text.x = element_text(angle = 65, hjust = 1), 
            text = element_text(size = 15)) + 
      
      # Annotation
      annotate("text", x = 4, y = 0.55, 
               label = "Technical certificates", 
               size = 5)
  } 
  
# Write plot to global environment
   assign(plot_name, output, envir = .GlobalEnv)
  
  
  
  
}
##----

When it comes time to call it, my Retention script is cleaner, in that instead of this relatively simple monstrosity cluttering everything up, I have

ret_cohortdist("RET_acad.institution.plot", RET_shadow.institution, "acad")
# Save to RDS
saveRDS(RET_acad.institution.plot, file = "Outputs/Retention/RETENTION_acad_institution.rds")
# Save to PDF 
ggsave("Outputs/Retention/RETENTION_acad_institution.pdf", plot = RET_acad.institution.plot, 
       width = 10, height = 8, dpi = 300, units = "in") 

From this, I can follow that I output the academic cohort, I save the Rdata file and an output PDF to the Retention folder. This code is repeated 6 times, the only difference being changing “acad” for e.g. “alev” in a fraction of lines compared to writing out the required calls 6 times. It’s clean, it’s logical, my inner child is at peace.

Complex function

Functions are also great when you’ve got something stupidly complex and long and is better contained in its own self-contained program.

Since I’ve stored my custom functions separately to the main retention script, I can find the relevant bit of code immediately, avoiding scrolling through endless levels of irrelevant file saving or data manipulation code. If someone wants to reproduce the output chart in their own, separate bit of analysis, I can just send the function across and know it’s going to work. Alternatively, you might decide actually you want to store useful functions used across analyses and projects in one separate (external to your project) folder, and one to store analysis scripts relevant to the project you’re working on.

I won’t go into too much detail on this function, it’s an exercise for the reader, but it’s just long rather than overly complex - you can see at the beginning I’ve again done label formatting and selection and again, I output it to some plot_name value which I can call later on, important for loading into my presentation.

#### Combined Retention measure - Retention measure by provider type ####
# Breaks out measure rates by institution type. Code for level 2 exists
# but not currently present in data for retained for 2nd year.

# plot_name = Character string, name of output plot in global environment.

# df_inst = Data frame set at institution level. Requires Institution flag, 
# Inscope and PT_Retained fields for ALL cohorts.

# df_nat = Data frame set at national level. Requires Inscope and PT_ versions of 
# fields for all measures/cohorts.

# Note: Independent schools and other schools are filtered as a matter of course. 
# If want those school data, then need to modify filtering section.
RET_combomeasures <- function(plot_name, df_inst, df_nat){
  
  # Formatting and colours
  institution_titles = c("MTD mainstream \n schools", "Sponsor-led \n academies"
                         , "Converter \n Academies", "Mainstream \n free schools,\n UTCs & SS", 
                         "Independent \n schools", "Sixth form \n colleges", "Other \n FE colleges") 
  institution_cohorts = c("Academic", "A level", "Applied General", "Tech level", 
                          "L2 Vocational", "Technical Certificates")
  RETmeasure_colours = c("steelblue3", "firebrick3", "forestgreen")
  
  ## Set up national level data ready for feed in later
  RETmeasures_provtype_cohort_ntl <- 
    rbind(
      # Headline retention measure
      (
        df_nat %>% 
          mutate(Measure = "Headline retention",
                 Academic = sum(((PT_RETAINED_ACAD_RET/100.0)*
                                   T_INSCOPE_ACAD_RET)) / sum(T_INSCOPE_ACAD_RET), 
                 Alevel = sum(((PT_RETAINED_ALEV_RET/100.0)*
                                 T_INSCOPE_ALEV_RET)) / sum(T_INSCOPE_ALEV_RET),
                 AppliedGen = sum(((PT_RETAINED_AGEN_RET/100.0)*
                                     T_INSCOPE_AGEN_RET)) / sum(T_INSCOPE_AGEN_RET),
                 Techlevel = sum(((PT_RETAINED_TLEV_RET/100.0)*
                                    T_INSCOPE_TLEV_RET)) / sum(T_INSCOPE_TLEV_RET),
                 L2VOC = sum(((PT_RETAINED_L2VOC_RET/100.0)*
                                T_INSCOPE_L2VOC_RET)) / sum(T_INSCOPE_L2VOC_RET),
                 TechCert = sum(((PT_RETAINED_TECHCERT_RET/100.0)*
                                   T_INSCOPE_TECHCERT_RET)) / sum(T_INSCOPE_TECHCERT_RET)) %>% 
          select(Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) 
      )
      ,
      # Retained and Assessed
      (
        df_nat %>% 
          mutate(Measure = "Retained and Assessed", 
                 Academic = sum(((PT_RET_ASSESSED_ACAD/100.0)*
                                   T_INSCOPE_ACAD_RET)) / sum(T_INSCOPE_ACAD_RET), 
                 Alevel = sum(((PT_RET_ASSESSED_ALEV/100.0)*
                                 T_INSCOPE_ALEV_RET)) / sum(T_INSCOPE_ALEV_RET), 
                 AppliedGen = sum(((PT_RET_ASSESSED_AGEN/100.0)*
                                     T_INSCOPE_AGEN_RET)) / sum(T_INSCOPE_AGEN_RET), 
                 Techlevel = sum(((PT_RET_ASSESSED_TLEV/100.0)*
                                    T_INSCOPE_TLEV_RET)) / sum(T_INSCOPE_TLEV_RET), 
                 L2VOC = sum(((PT_RET_ASSESSED_L2VOC/100.0)*
                                T_INSCOPE_L2VOC_RET)) / sum(T_INSCOPE_L2VOC_RET), 
                 TechCert = sum(((PT_RET_ASSESSED_TECHCERT/100.0)*
                                   T_INSCOPE_TECHCERT_RET)) / sum(T_INSCOPE_TECHCERT_RET)) %>% 
          select(Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) 
      )
      ,
      # Retained for a 2nd Year
      (
        df_nat %>% 
          mutate(Measure = "Retained in the 2nd year",
                 Academic = sum(((PT_RETAINED_ACAD_2NDYR/100.0)*
                                   T_INSCOPE_ACAD_2NDYR)) / sum(T_INSCOPE_ACAD_2NDYR), 
                 Alevel = sum(((PT_RETAINED_ALEV_2NDYR/100.0)*
                                 T_INSCOPE_ALEV_2NDYR)) / sum(T_INSCOPE_ALEV_2NDYR),
                 AppliedGen = sum(((PT_RETAINED_AGEN_2NDYR/100.0)*
                                     T_INSCOPE_AGEN_2NDYR)) / sum(T_INSCOPE_AGEN_2NDYR),
                 Techlevel = sum(((PT_RETAINED_TLEV_2NDYR/100.0)*
                                    T_INSCOPE_TLEV_2NDYR)) / sum(T_INSCOPE_TLEV_2NDYR),
                 L2VOC = NaN, 
                 #sum(((PT_RETAINED_L2VOC_2NDYR/100.0)*
                 #T_INSCOPE_L2VOC_2NDYR)) / sum(T_INSCOPE_L2VOC_2NDYR),
                 TechCert = NaN) %>% 
          #sum(((PT_RETAINED_TECHCERT_2NDYR/100.0)*
          #T_INSCOPE_TECHCERT_2NDYR)) / sum(T_INSCOPE_TECHCERT_2NDYR)) %>% 
          select(Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) 
      )
    ) %>%  
    select(Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) %>% 
    melt(id = c("Measure")) %>% 
    setNames(c("Measure", "Cohort", "National state-funded"))
  
  # Plot the provider type/cohort retention rate breakdown
  # NOTE: SECTIONS WITH L2 COHORTS ARE COMMENTED OUT. ADD BACK IN IF USING 
  # L2 COHORTS!!
  output <- 
    rbind(
      # Headline retention measure
      (
        df_inst %>% 
          group_by(INSTITUTION) %>% 
          mutate(Measure = "Headline retention",
                 Academic = sum(((PT_RETAINED_ACAD_RET/100.0)*
                                   T_INSCOPE_ACAD_RET)) / sum(T_INSCOPE_ACAD_RET), 
                 Alevel = sum(((PT_RETAINED_ALEV_RET/100.0)*
                                 T_INSCOPE_ALEV_RET)) / sum(T_INSCOPE_ALEV_RET),
                 AppliedGen = sum(((PT_RETAINED_AGEN_RET/100.0)*
                                     T_INSCOPE_AGEN_RET)) / sum(T_INSCOPE_AGEN_RET),
                 Techlevel = sum(((PT_RETAINED_TLEV_RET/100.0)*
                                    T_INSCOPE_TLEV_RET)) / sum(T_INSCOPE_TLEV_RET),
                 L2VOC = sum(((PT_RETAINED_L2VOC_RET/100.0)*
                                T_INSCOPE_L2VOC_RET)) / sum(T_INSCOPE_L2VOC_RET),
                 TechCert = sum(((PT_RETAINED_TECHCERT_RET/100.0)*
                                   T_INSCOPE_TECHCERT_RET)) / sum(T_INSCOPE_TECHCERT_RET)) %>% 
          select(INSTITUTION, Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) %>% 
          ungroup() %>% 
          unique()
      )
      ,
      # Retained and Assessed
      (
        df_inst %>% 
          group_by(INSTITUTION) %>% 
          mutate(Measure = "Retained and Assessed", 
                 Academic = sum(((PT_RET_ASSESSED_ACAD/100.0)*
                                   T_INSCOPE_ACAD_RET)) / sum(T_INSCOPE_ACAD_RET), 
                 Alevel = sum(((PT_RET_ASSESSED_ALEV/100.0)*
                                 T_INSCOPE_ALEV_RET)) / sum(T_INSCOPE_ALEV_RET), 
                 AppliedGen = sum(((PT_RET_ASSESSED_AGEN/100.0)*
                                     T_INSCOPE_AGEN_RET)) / sum(T_INSCOPE_AGEN_RET), 
                 Techlevel = sum(((PT_RET_ASSESSED_TLEV/100.0)*
                                    T_INSCOPE_TLEV_RET)) / sum(T_INSCOPE_TLEV_RET), 
                 L2VOC = sum(((PT_RET_ASSESSED_L2VOC/100.0)*
                                T_INSCOPE_L2VOC_RET)) / sum(T_INSCOPE_L2VOC_RET), 
                 TechCert = sum(((PT_RET_ASSESSED_TECHCERT/100.0)*
                                   T_INSCOPE_TECHCERT_RET)) / sum(T_INSCOPE_TECHCERT_RET)) %>% 
          select(INSTITUTION, Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) %>% 
          ungroup() %>% 
          unique()
      )
      ,
      # Retained for a 2nd Year
      (
        df_inst %>% 
          group_by(INSTITUTION) %>% 
          mutate(Measure = "Retained in the 2nd year",
                 Academic = sum(((PT_RETAINED_ACAD_2NDYR/100.0)*
                                   T_INSCOPE_ACAD_2NDYR)) / sum(T_INSCOPE_ACAD_2NDYR), 
                 Alevel = sum(((PT_RETAINED_ALEV_2NDYR/100.0)*
                                 T_INSCOPE_ALEV_2NDYR)) / sum(T_INSCOPE_ALEV_2NDYR),
                 AppliedGen = sum(((PT_RETAINED_AGEN_2NDYR/100.0)*
                                     T_INSCOPE_AGEN_2NDYR)) / sum(T_INSCOPE_AGEN_2NDYR),
                 Techlevel = sum(((PT_RETAINED_TLEV_2NDYR/100.0)*
                                    T_INSCOPE_TLEV_2NDYR)) / sum(T_INSCOPE_TLEV_2NDYR),
                 L2VOC = NaN, 
                 #sum(((PT_RETAINED_L2VOC_2NDYR/100.0)*
                 #T_INSCOPE_L2VOC_2NDYR)) / sum(T_INSCOPE_L2VOC_2NDYR),
                 TechCert = NaN) %>% 
          #sum(((PT_RETAINED_TECHCERT_2NDYR/100.0)*
          #T_INSCOPE_TECHCERT_2NDYR)) / sum(T_INSCOPE_TECHCERT_2NDYR)) %>% 
          select(INSTITUTION, Measure, Academic, Alevel, AppliedGen, Techlevel, L2VOC, TechCert) %>% 
          ungroup() %>% 
          unique()
      )
    ) %>% 
    mutate(INSTITUTION_TYPE2 = ifelse(INSTITUTION == 1, 
                                      institution_titles[1],
                                      ifelse(INSTITUTION == 2, 
                                             institution_titles[2], 
                                             ifelse(INSTITUTION == 3, 
                                                    institution_titles[3], 
                                                    ifelse(INSTITUTION == 4, 
                                                           institution_titles[4], 
                                                           ifelse(INSTITUTION == 5, 
                                                                  institution_titles[5], 
                                                                  ifelse(INSTITUTION == 6, 
                                                                         institution_titles[6],
                                                                         ifelse(INSTITUTION == 7, 
                                                                                institution_titles[7],
                                                                                ifelse(INSTITUTION == 8, 
                                                                                       institution_titles[8], NA
                                                                                ))))))))) %>% 
    mutate(INSTITUTION_TYPE2 = fct_reorder(INSTITUTION_TYPE2, INSTITUTION, .desc = FALSE)) %>% 
    select(INSTITUTION_TYPE2, Measure, Academic, Alevel, 
           AppliedGen, Techlevel, L2VOC, TechCert) %>% 
    melt(id=c("INSTITUTION_TYPE2", "Measure")) %>% 
    setNames(c("Institution type", "Measure", "Cohort", "RET")) %>% 
    
    # Filters to remove/include Independent schools/Other schools (comment out if NA)
    filter(`Institution type` %in% c("MTD mainstream \n schools", "Sponsor-led \n academies", 
                                     "Converter \n academies", "Mainstream \n free schools,\n UTCs & SS", 
                                     "Sixth form \n colleges", "Other \n FE colleges")) %>% 
    
    # Filters to remove A level and Tech certs since are subsets of Academic and L2 Voc) 
    filter(Cohort %nin% c("Alevel", "TechCert")) %>% 
    
    # Add national data to the dataset (to be overlaid later) and change cohort variable to final value
    left_join(., RETmeasures_provtype_cohort_ntl, by = c("Measure", "Cohort")) %>% 
    
    # Change cohort variable to final text and set ordering
    mutate(Cohort = ifelse(Cohort == "Academic", "Academic", 
                           ifelse(Cohort == "Alevel", "A level", 
                                  ifelse(Cohort == "AppliedGen", "Applied General", 
                                         ifelse(Cohort == "Techlevel", "Tech level", 
                                                ifelse(Cohort == "L2VOC", "L2 Vocational", 
                                                       ifelse(Cohort == "TechCert", 
                                                              "Technical Certificates", NA))))))) %>% 
    mutate(Cohort = factor(Cohort, levels = institution_cohorts)) %>% 
    
    # Set up plot and source data for cohorts
    ggplot(., aes(x = `Institution type`, y = RET, fill = Measure)) +
    geom_bar(stat = "identity", 
             position = position_dodge(width = 0.5) # position_nudge moves bars right
    ) +
    
    # Overlay national level data lines now
    geom_hline(aes(yintercept = `National state-funded`, color = Measure)) +
    
    # Formatting chart, text and legend. 
    scale_fill_manual(values = RETmeasure_colours) +
    scale_colour_manual(values = RETmeasure_colours) +
    scale_x_discrete("Institution type") + 
    scale_y_continuous(name = "Percentage of students", 
                       limits = c(0, 1), 
                       labels = scales::percent, 
                       breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1)) + 
    theme_hc() + 
    
    theme(text = element_text(size = 20), 
          axis.text.x = element_text(size = 15), 
          axis.text.y = element_text(size = 20)) +
    
    theme(legend.position="top") +
    theme(legend.text = element_text(size=20), 
          legend.title=element_text(size=20)) +
    
    # Set up facets for each cohort
    facet_wrap(~ Cohort, ncol = 1)
  
  # Write plot to global environment
  assign(plot_name, output, envir = .GlobalEnv)
  
}

For those of you interested, the output plot looks like (It can still do with a bit more polishing, but just imagine how you’d try to build this in Excel?)

Hurry, hurry, hurry! Step right up folks, get your slide packs right here

One of the final hurdles of any piece of analysis is disseminating it, presenting it, and generating interest. I’d seen previously HTML documents written in R markdown but hadn’t seen something that output to a series of slides as part of one big pipeline. RStudio and Rmarkdown has several options for producing a slide show:

  • Beamer: PDF presentations
  • ioslides, slidy: HTML presentations
  • xaringan: Rmarkdown extension for HTML presentations, very pretty, allows gifs and interactive elements within slides.
  • PowerPoint: PowerPoint presentations
  • Rpres: RStudio integrated presentations

For this, I have used the Rpres environment, which is present in Rstudio and allows you to preview presentations within the Studio IDE with a straightforward Markdown authoring syntax. To create a new R presentation, you execute the New File -> R Presentation command:

The very first slide in the deck is the title slide, with a special background colour and displayed with a larger heading. As visible below, the syntax is just Markdown, with each slide delimited by section headings that use =========. Which is quite nice if you’re coming from just about having learned Rmarkdown (or in the process) and don’t need another thing to learn in time for some presentation!

To update the preview, you can click Preview (which replaces the usual ‘Knit’) and the presentation is produced. In the Presentation window, selecting More -> Save as Web Page allows a final separate version of the slide pack to be generated.

This will output a ‘good enough’ slide pack that has some basic formatting right out of the box. You can add additional things like transitions styles and speeds, slide navigation menus, incremental display of slide elements if you want to hold off on the big reveal… However, I think it’s worth mentioning that we can improve the look of our presentations further by using CSS, and that’s not just limited to RPres files, it applies to outputs in HTML whether it be Rmarkdown or slides using e.g. xaringan.

CSS (or Cascading Style Sheets) is one of the core languages of the internet, being responsible for the visual layout of webpages. There are several benefits beyond making it look good when using CSS in our presentations:

  • Consistency - You only have to specify these details once for any element. So for example, if I want tables to look a certain way, I would define CSS which would force all tables within my document to appear the same.
  • Easier to maintain and update - To change the style of an element once across the entire slide pack, I only have to make an edit in one place. The bigger the slide pack, the bigger the time saving.
  • Superior styles - All presentation packages will have a specific style or theme. By applying CSS, you can make tweaks to personalise outputs to you, your team/area etc. This applies to R markdown HTML documents too (this document is written with the basic sandstone theme).
  • Exact positioning - Style sheets can position elements wherever you want them, defining column sizing/layouts or image positioning for example.

With minimal CSS, the difference can be quite noticeable:

In addition to the obvious differences in text formatting, the standard formatting and layout options don’t appear to provide sufficient functionality to place the Department of Education logo in the corner as we have post-CSS formatting.

Adding/modifying CSS can be done in two ways - either directly into your .Rpres document, or referenced in a separate CSS file stored in the same directory as your presentation file. Here I have introduced it directly in my .Rpres file:

Only elements that you reference will be modified, otherwise the existing style for a particular element (e.g. main title slide header) will remain.

You can get a copy of the default CSS style sheet used in RStudio here, so feel free to extract chunks into your presentation and change values and see what impact it has!

Write it, cut it, paste it, save it. or…just automate it.

A reproducible pipeline should have minimal input after the first iteration, potentially just to add additional functionality or make changes to source data, or because someone wants a bar a certain colour…

To accomplish this, I added a Master script which sources all the various components of the analysis:

I can choose to run each script sequentially, or select everything and run it from scratch. As mentioned before, if I hand over the entire project with just the source code, as long as the person has access to the same tables as me, they will be able to recreate the same plots.

In addition, I save my current workspace at the end of the scripts, which using the master file, I remove intermediate tables and plots. An intentional omission, objects created outside the Rpres document are not recognised and a separate R session is used to render the document. Consequently, I have to load it back in when generating the final presentation, but this is still a lot easier than outputting all my analysis into a powerpoint presentation and having to fiddle with positioning or formatting.

Conclusions

I’ve shown you how to bring together R projects, functions, and a final presentation in one big piece of work that you can run in a very short time. With this process in place, I can go from no PDFs or plots to a full set of completed analysis and presentation in about 20 seconds. Rather than sharing loads of PDFs with analytical colleagues, I can share my scripts and they can run it for themselves and see how things are working, getting a more rounded view of all the analysis if they want it. Yes, it takes time to set up and learn, and yes you won’t need to do an entire R project workflow for every single piece of analysis you might carry out (unless you really want to).

However, when you’re creating something that is probably going to be updated next year or is bringing together lots of cool analysis that is all interlinked, the benefits of having e.g. a one-click and done approach (or 90%) combined with re-usable generalised/standalone functions with syntax clearly separated from the main workflow will save you a lot of time and headaches!

I hope this has been helpful!

FIN